Webpage: https://v993.github.io/Representative-Polarity-US-House/

Problem Overview:¶
U.S. politics have increasingly become more polarized and complicated in recent decades, a major focal point of today's political science. For many, this is cause for concern- and not only because of its impact on Thanksgiving dinners. Pernicious polarization has been linked to democratic erosion (Carnegie), and as the fastest polarizing country in the world (Brown), we as constituents have a responsibility to question what our representatives' motives are, and whether they have our best interests at heart. So what are our responsibilities? Many of us understand the importance of voting, but how can we keep track of what decisions our representatives are making for us, and whether or not we should vote for them down again the line? As important as voting is in the U.S., many of us have better things to do than endlessly scrutinize the representative we voted for because they were blue or red in that election that felt like yesterday, and yet, is somehow happening again, urging you out of your house and ruining your Tuesday afternoon.
Quantification of a representative on an ideological plane allows us to summarize who a representative is without studying their voting patterns or past policies. We can tell a lot about a representative by comparing them to their peers and contextualizing their ideology in relation to others. It would be a lot easier to say that a person is quantifiably -0.7 on an ideological scale compared to their colleagues than it would be to summarize the legislative decisions they have made.
Furthermore, it is by no means a new idea to conceptualize politicians in multidimensional space in order to understand their political ideology. Famously, Poole and Rosenthal estimate spatial coordinates for representatives using political choices: votes*. Poole and Rosenthal's DW-NOMINATE system (Dynamic Weighted NOMINAI-Three-step-Estimation) represents legislators in two-dimensional map showing how similar their voting records are, and theoretically, their political ideologies. This also means that the means by which we may compute a representative's position in this space are still confined to those politicians who we know through voting records. New representatives who have never voted obviously cannot have a NOMINATE score.
Understanding a candidate's place in the context of their fellow representatives could be a useful tool toward understanding where politicians lie (no pun intended), and how they can be quantified- prior to election. In this project, I hope to use data on political representatives and the districts they represent to quantify their political ideologies and predict NOMINATE scores without using voting records. The main goal of this project is to make large-scale politics more digestible. I've done some work on this in the past (New York Data Project) but never in a learning capacity (i.e. machine learning). My work will be centered in this website. I will focus specifically on House representatives. There are more representatives to train on, and I believe we will encounter more variance on a lower level of congress.
The central questions which will guide this study are the following:
- Are data on representatives alone adequate to predict NOMINATE scores?
- What features of a representative have the largest impact on their ideology?
- Can NOMINATE scores be used to predict features of a representative (e.g. age, aspects of constituency, finances)
Citations:
- Carnegie, “What Happens When Democracies Become Perniciously Polarized?
- Brown, “U.S. is Polarizing Faster Than Other Democracies”
- *Poole, Keith T., and Howard Rosenthal. “A Spatial Model for Legislative Roll Call Analysis.” American Journal of Political Science, vol. 29, no. 2, 1985, pp. 357–84. JSTOR
# load pretty jupyter
%load_ext pretty_jupyter
%%capture
%pip install numpy pandas thefuzz matplotlib importlib
import re
import numpy as np
import pandas as pd
from thefuzz import fuzz
from math import floor, ceil
import matplotlib.pyplot as plt
import fresh_data.get_datasets
import importlib
importlib.reload(fresh_data.get_datasets) # reload get_datasets every time this cell is run
from fresh_data.get_datasets import *
# plt.rcParams['axes.grid'] = True # Universal grids for plots
plt.rcParams.update({'font.size': 22}) # Universal font size for plots
# Set facecolor for plots, best for exporting
plt.rcParams['axes.facecolor']='white'
plt.rcParams['savefig.facecolor']='white'
Data/Resources:¶
In order to understand representatives ideology without their voting records, we're going to need some good substitutes. To cover our bases, I've pulled data for a representative's constituents and as much data on the representative I could find.
Representative Data:
The best sources of data on representatives that aren't vote-oriented are finance data, as all representatives are required by law to report their campaign and personal finances to the FEC. Two important sources of data here are OpenSecrets, which is a nonpartisan organization which tracks money in politics, and the FEC, from which some of the OpenSecrets data is collected.
- VoteView DW-NOMINATE scores of representatives in the house of congress
- OpenSecrets data on lobbying, campaign finance, and personal finances for congressional representatives
- FEC campaign finance data for congressional representatives
State Demographics:
A primary concern (at least it should be a primary concern) for representatives are their constituents. Constituents decide who holds public office, and this often attracts a particular person to represent a particular district. Using demographic data of a representative's constituents can give us vital insight into the ideology of the representative. Furthermore, state data is easier to find than data on representatives, and there are a fair number of sources here, not all utilized in this study. We need the most comprehensive representation of a member's constituency possible, so it would be ideal to collect data on a district level (the units members of the house represent). Sadly, good sources of district-level data are sparse. For the sake of this study, we will extrapolate state demographic information across a state's representatives. This approach is not ideal for reasons we will see later in our data exploration, but we will need to make do.
- Pew Research Center religious populations in each state, and questions from the census on belief in god
- US Census decennial population and geodata per state
- KFF state demographics data including race and poverty statistics
- IRS data on SAIPE (Small Area Income and Poverty Estimates)
The range of data collected can hopefully give us a good idea of who the representative is, and we can filter down to more impactful features down the road if desired.
Extract, Transform, Load (ETL)¶
These sources have been neatly compiled into the following call to my data package. For more details, please see the data wrangling scripts in the Github for this study, linked here. The Github dives into the wrangling steps I took to form this dataset. The sources linked above are diverse, and the methods of loading these sources into my DataFrame took a combination of web scraping, file downloading en masse, and API calls. Entity resolution, not to mention data discrepancies, was no simple task. I'd estimate that 20% of the work for this project is what you see in this notebook, the remainder was entirely ETL.
The get_df() function call you see below performs all data extraction, transformation, and loading for the utilized datasets. Due to the nature of my fuzzy matching algorithm, the runtime of this operation is around 1.5 minutes. Because of the duration of this operation, I have elected to save this file to "full_df.csv" (available in my Github). If desired, one could uncomment the ETL line of code below and use a freshly generated source, but for the purposes of this project, using the .csv is sufficient.
# ETL code present in data.py, which can be freshly generated using this code:
import data
import importlib
importlib.reload(data) # reload get_df every time this cell is run
from data import get_df
# df = get_df() # Get full DF (takes approximately 1 minute with current merging strategy):
# df.to_csv("full_df.csv", index=False)
# Instead of generating new data, we can use pre-generated data located in 'full_df.csv':
df = pd.read_csv("full_df.csv")
df.head(3)
| representative | state_name | district_code | party | congress | year_range | born | age | nominate_dim1 | nominate_dim2 | ... | hindu | historically_black_protestant | jehovahs_witness | jewish | mainline_protestant | mormon | muslim | orthodox_christian | unaffiliated_religious_nones | population | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | DICKINSON, William Louis | Alabama | 2 | Republican Party | 101 | 1989-1991 | 1925.0 | 83.0 | 0.398 | -0.057 | ... | 0.01 | 0.04 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 4447100.0 |
| 1 | BEVILL, Tom | Alabama | 4 | Democratic Party | 101 | 1989-1991 | 1921.0 | 84.0 | -0.213 | 0.976 | ... | 0.01 | 0.04 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 4447100.0 |
| 2 | NICHOLS, William Flynt | Alabama | 3 | Republican Party | 101 | 1989-1991 | 1918.0 | 70.0 | -0.042 | 0.872 | ... | 0.01 | 0.04 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 4447100.0 |
3 rows × 52 columns
Interesting Data Inconsistencies¶
Sander Levin is a representative from Michigan who at two points in his career represented two different districts in Michigan. From 1983-1993, he represented the 17th district, and did not recieve any contributions (at least on file with the FEC). In 1993 he retired from the 17th district to campaign and win the 12th district House seat for Michigan. This seat would later get redistricted to the 9th district in 2012. In 2017, he announced he would not run for reelection in 2018. Instead, his son, Andy Levin, became his successor as the representative for the 9th district.
Alaska, Wyoming, Montana, North Dakota, South Dakota, Vermont, and Delaware all have only one seat in the US House of Representatives. Different databases encode this single district number differently, as either 01, or 00. These needed to be manually recoded.
- Redistricting:
- Ron Barber is a representative from Arizona who represented the 2nd district. He took office in 2012, when the district was still numbered as the 8th district- it was redistricted in 2013. VoteView's polarization notes this under the 8th district in the 112th session of congress (2011-2013), while the FEC records his campaign finance information for 2011-2012 under the 2nd district. This requires manual reconciliation.
- Kathy Hochul is a representative from New York who represented the 26th district. She took office in 2011, and lost the election in 2013 when the district was redistricted into the 27th district. VoteView's polarization notes this under the 26th district, yet the FEC again records her campaign finance information for 2011-2012 under the 27th district, which is inaccurate. This requires manual reconciliation.
- Conor Lamb is a representative and attorney from Pennsylvania who represented the 17th and 18th districts. Conor took office in 2018 representing the 17th district after his predecessor, a pro-life Republican, resigned following a scandal involving his desire for a mistress to have an abortion. Following his election into office, the 2019 redistricting made the area far more Republican. Conor Lamb, a Democrat, ran for the 18th district following the redistricting, giving up his seat in the 17th district in order to win the election for the 18th.
- Data discrepancies:
- Year encodings make matching on these datasets wildly difficult. The most confusing part of researching this part of my project was understanding electoral cycles (despite being a political science major in undergrad). I quickly realized that trying to match based on year would not work out for several reasons. Not only are electoral cycles measured slightly differently across different sources, but financial records for campaigns often encompass spending in the years leading up to an election and after, making matching much more difficult. This resulted in the current implementation, matching based on name, district, and state, a time-consuming and inefficient method of matching which takes a very long time.
- The FEC reports finances for candidates, while OpenSecrets and our polarization DB both include information for representatives. The FEC includes individuals who ran for office, even if they didn't hold office.
- The FEC includes finances for representatives of non-voting entites like Guam or the Virgin Islands, which are part of the United States but have no representation in congress. These have to be excluded from our study.
- For whatever reason, the following representatives do not have campaign finance information for some of their campaigns on file with the FEC. The cause of this could be incorrect parsing (though the number of times I have checked begs to differ), or perhaps very, very cheap campaigns. Some of these could be incumbents who didn't spend money on campaigns or didn't have competititon. Here is the full list:
- LaTOURETTE, Steven C: [106, 107]
- FORBES, J. Randy: [107]
- ISAKSON, Johnny [107]
Exploratory Data Analysis (EDA):¶
A Brief note on DW-NOMINATE¶
Per Poole and Rosenthal:
- The first dimension picks up differences in ideolology, which is represented through the "liberal" vs. "conservative" (also referred to as "left" vs. "right") proportions throughout American history. Negative denotes a liberal disposition, positive a conservative.
- The second dimension picks up differences within the major political parties over slavery, currency, nativism, civil rights, and lifestyle issues during periods of American history.
For most purposes, the second dimension isn't as relevant. For the purposes of this study, we will focus on the first dimension. Using the 101st session from 1989 and our most recent 116th session of congress, we can see a little of what political science frequently (and unnecessarily) reminds us: America is becoming more polarized. The difference is pretty striking visually:
congress116 = df[df["congress"]==116]
congress101 = df[df["congress"]==101]
fig, (ax1, ax2) = plt.subplots(nrows=1,ncols=2,figsize=(18,8))
ax1.set_xlabel("nominate_dim1")
ax1.set_ylabel("Frequency")
ax1.hist(congress101["nominate_dim1"], alpha=0.5, label="101st session")
ax1.hist(congress116["nominate_dim1"], alpha=0.5, label="116th session")
ax1.legend(loc='upper right')
ax1.title.set_text("NOMINATE 1st Dim:");
ax1.grid()
all_years = df["year_range"].unique()
year_mask = [df[df["year_range"]==year]["nominate_dim1"].abs().mean() for year in all_years]
all_years = [int(year[:4]) for year in all_years]
ax2.plot(all_years,year_mask)
ax2.set_xlabel("Year")
ax2.set_ylabel("abs nominate_dim1")
ax2.title.set_text("Average Abs Polarity by Year")
ax2.grid()
fig.tight_layout()
# fig.savefig(f"nominate_scores_over_time.png", dpi=96)
This basic extraction shows us a pretty striking relationship. The histogram on the left shows the trend away from a uniform distribution, with both peaks (both ideological centers) becoming tighter and tigher clusters away from 0.0 (ideological moderate). The line plot on the right demonstrates the average absolute ideological polarity (across the isle), which we can see also trends away from 0.0.
Age in the House¶
Before diving into ideological scores, let's investigate the age values in congress and see what our demographics are like. Considering the stereotypical polarity of older generations in the U.S., examining the age distribution in congress could be useful.
print("Average age between 1989-2020:", df["age"].mean())
print("Median age between 1989-2020:", df["age"].median())
print("Max age in between 1989-2020:", df["age"].max())
Average age between 1989-2020: 58.08426095533324 Median age between 1989-2020: 56.0 Max age in between 1989-2020: 101.0
Definitely not a hip crowd by any metric. Not sure how a 101 year old held a public office within the last 3 decades. In any case, we have a diversely aged assortment of representatives here, as the following boxplot and histogram demonstrate. I'd argue the distribution leans a little too far to the older side.
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(15,8))
df["age"].plot.box(
color=dict(medians='r'),
widths=0.5,
boxprops=dict(linestyle='-', linewidth=2),
flierprops=dict(linestyle='-', linewidth=2),
medianprops=dict(linestyle='-', linewidth=2.5),
whiskerprops=dict(linestyle='-', linewidth=2),
capprops=dict(linestyle='-', linewidth=2),
grid=True,
ax=ax1)
ax1.set_ylabel("age")
ax1.title.set_text("Age Boxplot")
ax2.hist(df["age"],color="orange",alpha=0.8)
ax2.set_ylabel("Count")
ax2.set_xlabel("age")
ax2.title.set_text("Age Histogram")
ax2.grid()
fig.tight_layout()
The results of these plots are a little scary- regardless of the idiom that "with age comes wisdom." Considering how difficult it has been to find data on congressional representatives, it will be interesting to see how age impacts a representative's polarity, especially with such a spread of data. The historgram on the right shows a right skew, and that tail is far longer than I think it should be.
New York Case Study¶
To illustrate the purpose of ideological scores and dive a little deeper into our representative data, let's visit New York. New York is a heavily liberal state ideologically speaking, mostly due to New York City. But upstate of NYC there are a good number of conservative voters, and we can see this reflected in the ideological plot below.
ny_116 = df.loc[(df["congress"] == 116) & (df["state_name"] == "New York")].groupby(["representative"], as_index=False)[["congress","nominate_dim1", "nominate_dim2", "party"]].agg(
nominate_dim1_mean=('nominate_dim1', 'mean'),
nominate_dim2_mean=('nominate_dim2', 'mean'),
congress=('congress', 'first'),
party=('party', 'first')
).sort_values("congress")
colors = {
"Republican Party": "red",
"Democratic Party": "blue"
}
party = {
"Republican Party": "R",
"Democratic Party": "D"
}
fig, ax = plt.subplots(figsize=(15, 8))
ax.scatter(x=ny_116['nominate_dim1_mean'], y=ny_116['nominate_dim2_mean'], c=ny_116['party'].map(colors), s=100)
ax.set_ylabel("nominate_dim2_mean")
ax.set_xlabel("nominate_dim1_mean")
delta = 0.005
for idx, row in ny_116.iterrows():
ax.annotate(row['representative'], (row['nominate_dim1_mean']+delta, row['nominate_dim2_mean']+delta), fontsize=12)
fig.tight_layout()
# fig.savefig(f"ny_nominate_scores.png", dpi=96)
With this plot, the idea of NOMINATE scores becomes pretty obvious. There are pretty definite clusters here between party in both dimensions, with the NOMINATE 1st dimension denoting the liberal-conservative axis. As demonstrated by the graph, negative is liberal, and positive is conservative. As mentioned before, we will be exclusively focusing on the 1st dimension.
There is another significant observation to be made here; poltical ideology is heavily polarized within a state. This means that our plan to use state demographics will fail to account for the diversity of political ideologies within a state, on a district level. This could result in some complications in our model, but it is a concession we will have to make due to lack of data.
State Observations¶
We've seen that political ideology has significant variance within a state. But how do those patterns show up on a national level? To investigate, we will use choropleths utilizing state aggregates to understand the trends we have already observed, and draw conclusions about what a state says about a particular representative.
from mpl_toolkits.axes_grid1 import make_axes_locatable
import matplotlib as mpl
# Get geographical data for states from local geodata file:
states_geodata = geopandas.read_file('fresh_data/geodata/usa-states-census-2014.shp')
def build_choropleth(column="nominate_dim1", cmap="RdBu_r", all_time=False, manual_table=False, table=None, halfrange=None):
"""
Builds choropleth according to specifications.
By default, generates polarity breakdown by state for 116th session of congress.
Params:
- column: the column being aggregated on a state basis
- cmap: cmap to be used for state coloration
- all_time: flag denoting whether chart concerns all time, or a particular time period
- manual_table: (sloppy) flag denoting intention to use a passed parameter 'table' instead of defaultly using global 'df' table
- table: (sloppy) manual table to be used in place of global 'df' table
- halfrange: param denoting the cmap centering norm to be used
"""
# Get subset of states for 116th congress from global 'df' when no table is provided:
if not manual_table:
subset_116 = df[df["congress"]==116].groupby(["state_name"], as_index=False)[column].mean().sort_values(by=column)
table = pd.merge(
subset_116,
states_geodata,
left_on="state_name",
right_on="NAME",
how="left"
)
else:
table = pd.merge(
table,
states_geodata,
left_on="state_name",
right_on="NAME",
how="left"
)
# Build GeoDataFrame from current df:
geo_df = geopandas.GeoDataFrame(table, geometry=table["geometry"])
# Remove Hawaii and Alaska for whom we do not have geodata:
hawaii_alaska_indices = geo_df[(geo_df["state_name"] == "Hawaii") | (geo_df["state_name"] == "Alaska")].index
geo_df = geo_df.drop(hawaii_alaska_indices)
# Normalize cmap from data:
if halfrange:
norm = mpl.colors.CenteredNorm(halfrange=halfrange)
else:
norm = None
# Plot
fig = plt.figure(1, figsize=(25,15));
ax = fig.add_subplot();
# Legend tweaks:
divider = make_axes_locatable(ax)
cax = divider.append_axes("right", size="5%", pad=-1.8)
ax.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
ax.tick_params(axis='y', which='both', left=False, right=False, labelleft=False)
ax.set_title(f"Average {column} by State{', 2019-2021' if not all_time else ''}")
ax.set_frame_on(False)
geo_df.apply(lambda x: ax.annotate(text=x.NAME, xy=x.geometry.centroid.coords[0], ha='center', fontsize=14),axis=1);
us_map = geo_df.plot(
ax=ax,
cmap=cmap,
norm=norm,
figsize=(15,15),
column=column,
legend=True,
legend_kwds={"label": f"Mean {column}", "orientation": "vertical"},
cax=cax,
);
return us_map.get_figure()
Average NOMINATE by State¶
To examine the breakdown of ideology on a state level, we'll focus on the 116th session of congress. Averaging all of the NOMINATE scores for this session on a state level allows us to color-code our choropleth appropriately, with a blue-red color mapping- blue being liberal and red being conservative:
fig = build_choropleth(halfrange=0.5)
fig.tight_layout()
¶
This plot demonstrates a pretty obvious conclusion for most Americans: Red states are conservative and blue states are liberal. With that, we can already make assumptions about our predictive power on a state level, liberals in blue states will get predicted fairly well, and conservatives will get predicted well in red states. Unfortunately, this also means that we will likely suffer predictively when a representative's ideologies don't line up with their state's average scores.
Let's dive deeper into these metrics.
Signed Democratic Party Change¶
To examine the impact of party, we will estimate the signed difference between the average nominate scores of 1989’s congress and 2019’s congressional sessions. Negative average changes denote a liberal polarity shift, and positive average changes denote a conservative polarity shift.
df["nominate_dim1_difference"] = df.groupby("state_name")["nominate_dim1"].transform(lambda x: np.ptp(x))
def get_party_map(party):
print(f"Loading data for {party}:")
party_df = df[df['party'] == party]
state_ideology_change = pd.DataFrame(columns=["state_name", f"{party} Ideology Change 1989-2021"])
for state in party_df["state_name"].unique():
state_df = party_df[party_df["state_name"]==state][["congress", "nominate_dim1"]]
congresses_avg_dim1 = state_df.groupby("congress", as_index=False)["nominate_dim1"].mean()
congress_1 = congresses_avg_dim1[congresses_avg_dim1["congress"]==congresses_avg_dim1["congress"].min()]["nominate_dim1"].mean()
congress_2 = congresses_avg_dim1[congresses_avg_dim1["congress"]==congresses_avg_dim1["congress"].max()]["nominate_dim1"].mean()
signed_difference = 0
if congress_1 > congress_2:
signed_difference = -abs(congress_1 - congress_2)
else:
signed_difference = abs(congress_2 - congress_1)
d={'state_name': state, f"{party} Ideology Change 1989-2021": signed_difference}
state_ideology_change.loc[len(state_ideology_change)]=d
state_ideology_change.head(3)
fig = (build_choropleth(column=f"{party} Ideology Change 1989-2021", cmap="RdBu_r", all_time=True, manual_table=True, table=state_ideology_change, halfrange=1.0))
return fig
fig = get_party_map("Democratic Party")
fig.tight_layout()
fig.savefig(f"democrat_ideology_difference.png", dpi=96)
Loading data for Democratic Party:
The graph gives us a little insight into the polarity we examined in our first few data explorations. As this graph only includes Democrats- we can see that Democratic representatives tend to get more liberal, with the exceptions being red states from our first choropleth. This tells us that party lines do not rule ideological shifts; Democrats get more liberal or more conservative based on their home state. This means that regardless of political affiliation, representatives from red states are more conservative, and representatives from blue states are more liberal. But it also tells us that these states get increasingly entrenched in their dominant ideology, another indicator of increasing polarity.
Signed Republican Party Change¶
Computing the same for the Republican Party, we can see the same relationship:
fig = get_party_map("Republican Party");
fig.tight_layout()
fig.savefig(f"republican_ideology_difference.png", dpi=96)
Loading data for Republican Party:
We can observe the same relationship for Republicans, with a serious shift towards conservatism in South Dakota.
Correlation Heatmap¶
To evaluate our constituent data, let's focus on a few key features of each state's population and see how they correlate with our target, nominate_dim1. To do so, we'll use a correlation heatmap to visually depict relationships within our data. Ideally, we will be able to identify some patterns and make some decisions on feature selection. This enhances our understanding of variable interactions and gives us an idea of what direction we should take the model.
import seaborn as sns
import matplotlib
matplotlib.rcParams.update({'font.size': 12})
correlation_specifics = df[["nominate_dim1", "believe_in_god_absolutely_certain", "do_not_believe_in_god", "other_dont_know_if_they_believe_in_god", \
"white", "black", "contributions_from_pacs", "contributions_from_individuals", "cash_on_hand", "debts", "total_poverty", "age", "receipts"]].corr()
fig, ax = plt.subplots(figsize=(15,10))
ax = sns.heatmap(correlation_specifics, vmin=-1, vmax=1, annot=True)
# Give a title to the heatmap. Pad defines the distance of the title from the top of the heatmap.
ax.set_title('Correlation Heatmap', fontdict={'fontsize':12}, pad=12);
matplotlib.rcParams.update({'font.size': 22})
There are no strong correlations, but hopefully our use of these metrics in tandem will aid in our modeling efforts. Some standout results:
- believe_in_god_absolutely_certain has the highest positive correlation with our nominate scores, we can assume that this could help us in modeling
- personal finance data has a very small correlation with our nominate scores, so our predictions will likely suffer on a district level as anticipated
- age has a relatively high correlation with nominate score, this could also be helpful
Single-Feature Regression¶
Using our highest correlation from the heatmap above, we will build a quick predictor to estimate nominate_dim1 without any other features:
from sklearn.linear_model import LinearRegression
# df.plot.scatter(x="believe_in_god_absolutely_certain", y="nominate_dim1")
holy_df = df.groupby("believe_in_god_absolutely_certain", as_index=False)["nominate_dim1"].mean()
fig, ax = plt.subplots(figsize=(15,8))
ax.grid()
ax.scatter(x=holy_df["believe_in_god_absolutely_certain"], y=holy_df["nominate_dim1"])
ax.set_ylabel("nominate_dim1")
ax.set_xlabel("believe_in_god_absolutely_certain")
ax.set_title("Simple Regression on 'believe_in_god' Feature")
lr = LinearRegression()
lr.fit(np.array(holy_df["believe_in_god_absolutely_certain"]).reshape(-1,1), np.array(holy_df["nominate_dim1"]).reshape(-1,1))
# ax.plot()
regression = [lr.predict(np.array([belief]).reshape(-1,1)) for belief in holy_df["believe_in_god_absolutely_certain"].unique()]
regression = np.array(regression).reshape(-1,1)
ax.plot(holy_df["believe_in_god_absolutely_certain"].unique(), regression, color="red", linewidth=3);
# regression = [predict(model, year) for year in data_continent["year"].unique()]
# plt.plot(data_continent["year"].unique(), regression, label=continent, color=cmap(color),linewidth=3)
fig.tight_layout()
# fig.savefig(f"holy_regression_batman.png", dpi=96)
Machine Learning:¶
Data Preparation¶
We have observed some correlations and possible predictive power in our EDA, but for these first few predictive attempts, we'll be using most of our data. For the most part, the only excluded fields are metadata from merges, and unrelated fields like the second nominate dimension.
Additionally, some final cleaning is required to get our data into the appropriate shape for model use.
Our models will aim to address two predictive tasks:
- Predicting DW-NOMINATE Scores
- Nationally
- State Level
- Predicting Representative Traits
- Age
- Constituent Belief in God
- State/District
- Debts
We will then take what our best performing models learn and extrapolate further patterns in the data.
Refactor Party¶
We are mostly interested in Republican/Democrat representatives, so we can configure this as a boolean where Republican is False and Democrat is True:
# Remove smaller parties or lack of party affiliation:
df = df[(df["party"] == "Republican Party") | (df["party"] == "Democratic Party")]
df["party"] = df["party"].apply(lambda x: 1 if "Democratic" in x else 0)
Refactor Districts¶
Our districts are integer values which from a top-down perspective could have relationships with each other. This can impede model learning- the 5th California district has nothing to do with the 5th Wisconsin district. Districts must be understood within the context of their state, and this information is not presently captured by our data. To properly learn from districts, we will need to combine our state and district columns, and only then can we one-hot-encode. This will increase the dimensionality of our data significantly, but I believe it will aid in model learning.
df["district_code"] = df["state_name"].apply(lambda x: x.lower()) + "_district_" + df["district_code"].astype(str)
Reshaping/Refactoring Data¶
We will use a functional implementation as to tweak what our model learns from down the line, and further investigate feature importances as well as necessity. The functional implementation accomplishes the following automatically:
Refactor Int/Float: TensorFlow requires float32, which requires a minor refactor which you can see in the code below.
Categorical One-Hot-Encoding: In order for our models to appropriately interpret and utilize our categorical features, we will need to one-hot encode each of them.
Train/Test Split: I opted for a 90-10 train/test split arbitrarily, as our dataset is large enough that a 10% test set seems adequate for our purposes.
from sklearn.model_selection import train_test_split
# Default columns for learning (most of them):
columns_X = [
## representative data:
'state_name', 'district_code', 'party',
'congress', 'year_range', 'born', 'age',
'nominate_number_of_votes', 'running_as',
'receipts', 'contributions_from_individuals', 'contributions_from_pacs',
'contributions_and_loans_from_candidate', 'disbursements',
'cash_on_hand', 'debts',
## state data:
'total_poverty', 'white', 'black', 'hispanic', 'asian', 'multiple_races',
# State data across years (only most recent data available):
'believe_in_god_absolutely_certain', 'believe_in_god_fairly_certain',
'believe_in_god_not_too_not_at_all_certain', 'believe_in_god_dont_know',
'do_not_believe_in_god',
'buddhist', 'catholic', 'evangelical_protestant', 'hindu',
'historically_black_protestant', 'jehovahs_witness', 'jewish',
'mainline_protestant', 'mormon', 'muslim', 'orthodox_christian',
'unaffiliated_religious_nones',
# State data from bicennial 2020, 2010, 2000 as closest-match
'population'
]
def prepare_data(columns_X=columns_X, columns_Y=[ 'nominate_dim1', 'nominate_dim2' ]):
y=df[columns_Y[0]] # Only target first dimension
X=df[columns_X]
int_cols = list(X.select_dtypes(include=[int]))
X.loc[:, int_cols] = X[int_cols].astype("category")
float64_cols = list(X.select_dtypes(include='float64'))
X.loc[:,float64_cols] = X[float64_cols].astype('float32')
y = y.astype('float32')
X = pd.get_dummies(X, prefix=X.select_dtypes(include=[object, "category"]).columns, columns=X.select_dtypes(include=[object, "category"]).head(3).columns)
return train_test_split(X,y,test_size=0.1)
x_train, x_test, y_train, y_test = prepare_data()
Predicting DW-NOMINATE Nationally¶
Model Selection¶
My approach in model selection is to do some surface level fitting on the training data with an assortment of models. I will select the highest default performers and tune their hyperparameters with grid searches to find my optimal model for this problem. We will repeat this process twice for both predictive tasks,
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.ensemble import AdaBoostRegressor, GradientBoostingRegressor, RandomForestRegressor
from sklearn.metrics import mean_absolute_error
def surface_level_test(model):
model.fit(x_train,y_train)
return round(model.score(x_train, y_train), 2), round(model.score(x_test, y_test), 2)
models = [
("LR", LinearRegression()),
("KNR", KNeighborsRegressor()),
("MLP", MLPRegressor()),
("ABR", AdaBoostRegressor()),
("GBR", GradientBoostingRegressor()),
("RFR", RandomForestRegressor())
]
for name, model in models:
train, test = surface_level_test(model)
print(f"- {name}: test: {test}")
- LR: test: 0.85 - KNR: test: 0.07 - MLP: test: -191208490.41 - ABR: test: 0.81 - GBR: test: 0.86 - RFR: test: 0.9
Surface-level training of the following algorithms found the following results:
- LinearRegression: 0.79
- TF-DNN (using MAE): 0.40 (not pictured above)
- KNNRegression: 0.13
- SKL-MLP: -2.3828207429 × 10^8 (I must have messed up training here)
- AdaBoost: 0.89
- GradientBoosting: 0.86
- RandomForestRegression: 0.91
Considering the number of tests made, I opted to focus on my best-performing model and remove the unused models from this notebook.
Understanding scores:¶
RandomForestRegressor scores are the "coefficient of determination" which is definted as one minus the residual sum of squares divided by the total sum of squares. 1.0 is the best possible score, but scores can be negative (the model can be arbitrarily worse). The score is easily interpretable, providing a percentage of variance explained by the model relative to the total variance in the data.
Grid Search¶
In order to tune the hyperparameters of our chosen model, I opted to use GridSearchCV, which will find the best performing combinations of specified hyperparameters to better fit the model on the data. The function below allows for a custom estimator and parameters to be utilized.
Additionally, the search has been parameterized with 5-fold cross validation for each test. I stuck with the default 5 as I wanted to reduce training time while maximizing use of my training data.
from sklearn.ensemble import AdaBoostRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingRegressor
def grid_search(estimator, params):
grid = GridSearchCV(estimator=estimator, param_grid=params, cv=5, n_jobs=-1, verbose=10)
grid.fit(x_train, y_train)
print()
print("\n Results from Grid Search " )
print("\n The best estimator across all searched params:\n",grid.best_estimator_)
print("\n The best score across all searched params:\n",grid.best_score_)
print("\n The best parameters across all searched params:\n",grid.best_params_)
return grid
Benchmarking¶
Below is a benchmarking function I wrote to quickly summarize the default scoring of a model and output the top feature importances. This will be handy when we run different tests on models and want to understand how they are progressing.
# Outputs relevant information about a model including default score and feature importances as a dataframe:
def benchmark_model(model):
try:
feature_importances = model.feature_importances_
except:
feature_importances = model.coef_
feature_importances = pd.DataFrame({'importance': feature_importances}, index=x_train.columns).sort_values(by='importance', ascending=False)
feature_importances["importance"] = feature_importances["importance"].apply(round, args=(4,))
test_score = model.score(x_test, y_test)
train_score = model.score(x_train, y_train)
print(f"Train Score : { train_score }")
print(f"Test Score : { test_score }")
display(feature_importances.head(20))
return feature_importances.head(20)
RandomForestRegressor¶
Before tuning hyperparameters, let's take a preliminary look at our feature importances from the base model:
lr = LinearRegression().fit(x_train, y_train)
benchmark_model(lr);
Train Score : 0.8829188371230571 Test Score : 0.8523750341200649
| importance | |
|---|---|
| multiple_races | 4.6133 |
| white | 0.9624 |
| black | 0.7884 |
| state_name_Hawaii | 0.6997 |
| state_name_California | 0.3729 |
| district_code_hawaii_district_1 | 0.3579 |
| district_code_hawaii_district_2 | 0.3418 |
| district_code_texas_district_36 | 0.3149 |
| state_name_New Mexico | 0.3001 |
| district_code_arizona_district_6 | 0.2916 |
| party_0 | 0.2915 |
| district_code_california_district_48 | 0.2850 |
| state_name_Texas | 0.2837 |
| state_name_Nevada | 0.2821 |
| state_name_Arizona | 0.2752 |
| district_code_wisconsin_district_9 | 0.2741 |
| district_code_texas_district_14 | 0.2689 |
| district_code_california_district_4 | 0.2683 |
| district_code_texas_district_3 | 0.2663 |
| district_code_ohio_district_8 | 0.2546 |
default_model = RandomForestRegressor(random_state=42).fit(x_train, y_train)
benchmark_model(default_model);
Train Score : 0.9877739224511929 Test Score : 0.9012254786181271
| importance | |
|---|---|
| party_0 | 0.4390 |
| party_1 | 0.3445 |
| running_as_Incumbent | 0.0273 |
| born | 0.0156 |
| disbursements | 0.0114 |
| age | 0.0100 |
| contributions_from_pacs | 0.0091 |
| contributions_from_individuals | 0.0084 |
| believe_in_god_absolutely_certain | 0.0075 |
| cash_on_hand | 0.0073 |
| black | 0.0066 |
| asian | 0.0062 |
| nominate_number_of_votes | 0.0061 |
| total_poverty | 0.0049 |
| multiple_races | 0.0046 |
| hispanic | 0.0046 |
| receipts | 0.0039 |
| believe_in_god_fairly_certain | 0.0037 |
| white | 0.0037 |
| do_not_believe_in_god | 0.0036 |
We can observe here that party is disproproportionally important compared to the other top features. This makes sense from our EDA where party is a major indicator of ideology, but I was expecting to see a heavier importance on location, where we have a large amount of historical data to demonstrate how a representative might lean. This means that the polarization of representatives on a district level prevents us from using state data effectively, as I had feared earlier. The model tries to make up the difference by using representative data, which has a sparse correlation with our target feature, but ends up relying on party to determine a member's ideology.
These results are not bad, covering ~90% of our test set's variance is still a success. It would seem that our model is overfitting, but some hyperparameter tuning and data modification might be able to help alleviate this in later tests. The reliance on party should be limited. I am interested in how well we can predict without it. To see if my assumptions about the model's learning behavior in the face of state polarization variance are accurate, let's run another test.
Training with State Data¶
Below is a subset of our data which includes exclusively state data. If my hypothesis is accurate, we will get a very low accuracy due to our reliance on representative data to understand polarity within a state.
test_columns = [
'state_name', 'district_code',
## state data:
'total_poverty', 'white', 'black', 'hispanic', 'asian', 'multiple_races',
# State data across years (only most recent data available):
'believe_in_god_absolutely_certain', 'believe_in_god_fairly_certain',
'believe_in_god_not_too_not_at_all_certain', 'believe_in_god_dont_know',
'do_not_believe_in_god',
'buddhist', 'catholic', 'evangelical_protestant', 'hindu',
'historically_black_protestant', 'jehovahs_witness', 'jewish',
'mainline_protestant', 'mormon', 'muslim', 'orthodox_christian',
'unaffiliated_religious_nones',
# State data from bicennial 2020, 2010, 2000 as closest-match
'population'
]
x_train, x_test, y_train, y_test = prepare_data(columns_X=test_columns)
default_model = RandomForestRegressor(random_state=42).fit(x_train, y_train)
benchmark_model(default_model);
Train Score : 0.8384893234680975 Test Score : 0.6581597603841642
| importance | |
|---|---|
| believe_in_god_absolutely_certain | 0.0762 |
| asian | 0.0368 |
| hispanic | 0.0335 |
| multiple_races | 0.0323 |
| white | 0.0292 |
| total_poverty | 0.0288 |
| population | 0.0203 |
| do_not_believe_in_god | 0.0196 |
| black | 0.0192 |
| district_code_georgia_district_5 | 0.0099 |
| district_code_north carolina_district_12 | 0.0069 |
| believe_in_god_fairly_certain | 0.0069 |
| district_code_texas_district_18 | 0.0069 |
| district_code_texas_district_30 | 0.0066 |
| district_code_mississippi_district_2 | 0.0066 |
| district_code_louisiana_district_2 | 0.0064 |
| district_code_missouri_district_1 | 0.0064 |
| district_code_texas_district_20 | 0.0062 |
| district_code_ohio_district_11 | 0.0061 |
| district_code_georgia_district_4 | 0.0060 |
Evidently my hypothesis is accurate and without data on the representative, the model is crippled. Interestingly, we can see the highest correlation feature we studied in our earlier EDA (believe_in_god_absolutely_certain) is the most important feature in this space.
Consideration of the entire feature space is important considering the impacts of constituents on their representatives. But without district-level data, it appears that some of our state data is not aiding our modeling process. To evaluate, we'll need another test:
Training with Representative Data¶
Below is a subset of our data which includes exclusively representative data, as well as state. If my hypothesis is accurate, we will get a very low accuracy due to our reliance on representative data to understand polarity within a state.
test_columns = [
## representative data:
'state_name', 'district_code', 'party',
'congress', 'year_range', 'born', 'age',
'nominate_number_of_votes', 'running_as',
'receipts', 'contributions_from_individuals', 'contributions_from_pacs',
'contributions_and_loans_from_candidate', 'disbursements',
'cash_on_hand', 'debts',
## state data:
'total_poverty', 'white', 'black', 'hispanic', 'asian', 'multiple_races',
# State data across years (only most recent data available):
'believe_in_god_absolutely_certain', 'believe_in_god_fairly_certain',
'believe_in_god_not_too_not_at_all_certain', 'believe_in_god_dont_know',
'do_not_believe_in_god',
'buddhist', 'catholic', 'evangelical_protestant', 'hindu',
'historically_black_protestant', 'jehovahs_witness', 'jewish',
'mainline_protestant', 'mormon', 'muslim', 'orthodox_christian',
'unaffiliated_religious_nones',
# State data from bicennial 2020, 2010, 2000 as closest-match
'population'
]
x_train, x_test, y_train, y_test = prepare_data(columns_X=test_columns)
default_model = RandomForestRegressor(random_state=42).fit(x_train, y_train)
benchmark_model(default_model);
from sklearn.ensemble import RandomForestRegressor
rfr_params = {
'n_estimators': [25, 50, 100, 150],
'max_features': ['sqrt', 'log2', None],
'max_depth': [3, 6, 9],
'max_leaf_nodes': [3, 6, 9],
}
grid_rfr = grid_search(RandomForestRegressor, rfr_params)
GradientBoostingRegressor¶
I have arbitrarily chosen the following hyperparameters to be tested in a unique instantiation of a GradientBoostingRegressor (GBR). As a warning, the following code takes around 25 minutes to run.
test_columns = [
'state_name',
## state data:
'total_poverty', 'white', 'black', 'hispanic', 'asian', 'multiple_races',
# State data across years (only most recent data available):
'believe_in_god_absolutely_certain', 'believe_in_god_fairly_certain',
'believe_in_god_not_too_not_at_all_certain', 'believe_in_god_dont_know',
'do_not_believe_in_god',
'buddhist', 'catholic', 'evangelical_protestant', 'hindu',
'historically_black_protestant', 'jehovahs_witness', 'jewish',
'mainline_protestant', 'mormon', 'muslim', 'orthodox_christian',
'unaffiliated_religious_nones',
# State data from bicennial 2020, 2010, 2000 as closest-match
'population'
]
x_train, x_test, y_train, y_test = prepare_data(columns_X=test_columns)
# Outputs relevant information about a model including default score and feature importances as a dataframe:
def benchmark_model(model):
feature_importances = pd.DataFrame({'importance': model.feature_importances_}, index=x_train.columns).sort_values(by='importance', ascending=False)
feature_importances["importance"] = feature_importances["importance"].apply(round, args=(4,))
test_score = model.score(x_test, y_test)
train_score = model.score(x_train, y_train)
print(f"Train Score : { train_score }")
print(f"Test Score : { test_score }")
display(feature_importances)
return feature_importances.head(20)
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(random_state=0)
model.fit(x_train, y_train)
benchmark_model(model);
Test Score : 0.9104885032679291 Train Score : 0.9874107098657662
| importance | |
|---|---|
| party | 0.7830 |
| running_as_Incumbent | 0.0275 |
| born | 0.0169 |
| contributions_from_pacs | 0.0116 |
| age | 0.0115 |
| ... | ... |
| state_name_Montana | 0.0000 |
| state_name_Hawaii | 0.0000 |
| state_name_North Dakota | 0.0000 |
| state_name_Vermont | 0.0000 |
| state_name_Alaska | 0.0000 |
173 rows × 1 columns
| importance | |
|---|---|
| party | 0.7830 |
| running_as_Incumbent | 0.0275 |
| born | 0.0169 |
| contributions_from_pacs | 0.0116 |
| age | 0.0115 |
| disbursements | 0.0105 |
| cash_on_hand | 0.0078 |
| contributions_from_individuals | 0.0078 |
| nominate_number_of_votes | 0.0068 |
| black | 0.0061 |
| total_poverty | 0.0060 |
| believe_in_god_absolutely_certain | 0.0060 |
| believe_in_god_fairly_certain | 0.0055 |
| multiple_races | 0.0054 |
| receipts | 0.0050 |
| asian | 0.0049 |
| hispanic | 0.0045 |
| contributions_and_loans_from_candidate | 0.0044 |
| debts | 0.0038 |
| do_not_believe_in_god | 0.0037 |
from sklearn.ensemble import AdaBoostRegressor
model = AdaBoostRegressor(n_estimators=50, random_state=0)
model.fit(x_train, y_train)
print(model.score(x_test, y_test))
output_importances(model.feature_importances_)
model.estimator_weights_
0.13503505585450315
array([0.57892607, 0.65830094, 0.33057098, 0.31857633, 0.30836485,
0.12349047, 0.36056242, 0.27059642, 0.1697619 , 0.32980739,
0.26745244, 0.08591508, 0.26265596, 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ])
from sklearn.ensemble import AdaBoostRegressor
abr_parameters = {
'estimator_max_depth':[i for i in [3, 5, 7]],
# 'estimator_min_samples_leaf':[5,10],
'n_estimators':[10,50,250],
'learning_rate':[0.01,0.1]
}
grid_ABR = grid_search(AdaBoostRegressor(), abr_parameters)
Fitting 5 folds for each of 18 candidates, totalling 90 fits [CV 1/5; 1/18] START estimator_max_depth=3, learning_rate=0.01, n_estimators=10. [CV 3/5; 1/18] START estimator_max_depth=3, learning_rate=0.01, n_estimators=10. [CV 2/5; 1/18] START estimator_max_depth=3, learning_rate=0.01, n_estimators=10. [CV 4/5; 1/18] START estimator_max_depth=3, learning_rate=0.01, n_estimators=10. [CV 1/5; 2/18] START estimator_max_depth=3, learning_rate=0.01, n_estimators=50. [CV 5/5; 1/18] START estimator_max_depth=3, learning_rate=0.01, n_estimators=10. [CV 2/5; 2/18] START estimator_max_depth=3, learning_rate=0.01, n_estimators=50. [CV 4/5; 2/18] START estimator_max_depth=3, learning_rate=0.01, n_estimators=50. [CV 3/5; 3/18] START estimator_max_depth=3, learning_rate=0.01, n_estimators=250 [CV 4/5; 3/18] START estimator_max_depth=3, learning_rate=0.01, n_estimators=250 [CV 3/5; 2/18] START estimator_max_depth=3, learning_rate=0.01, n_estimators=50. [CV 5/5; 3/18] START estimator_max_depth=3, learning_rate=0.01, n_estimators=250 [CV 1/5; 4/18] START estimator_max_depth=3, learning_rate=0.1, n_estimators=10.. [CV 2/5; 4/18] START estimator_max_depth=3, learning_rate=0.1, n_estimators=10.. [CV 3/5; 4/18] START estimator_max_depth=3, learning_rate=0.1, n_estimators=10.. [CV 4/5; 4/18] START estimator_max_depth=3, learning_rate=0.1, n_estimators=10.. [CV 5/5; 2/18] START estimator_max_depth=3, learning_rate=0.01, n_estimators=50. [CV 5/5; 4/18] START estimator_max_depth=3, learning_rate=0.1, n_estimators=10.. [CV 1/5; 5/18] START estimator_max_depth=3, learning_rate=0.1, n_estimators=50.. [CV 2/5; 5/18] START estimator_max_depth=3, learning_rate=0.1, n_estimators=50.. [CV 3/5; 5/18] START estimator_max_depth=3, learning_rate=0.1, n_estimators=50.. [CV 4/5; 5/18] START estimator_max_depth=3, learning_rate=0.1, n_estimators=50..
from sklearn.ensemble import RandomForestRegressor
gbr_parameters = {
'learning_rate': [0.01, 0.02, 0.03, 0.04],
'subsample' : [0.9, 0.5, 0.2, 0.1],
'n_estimators' : [ 500 ], # 750, 1000, 1500
'max_depth' : [4,6,8,10],
'estimator' : RandomForestRegressor(max_depth=3)
}
grid_GBR = grid_search(GradientBoostingRegressor(), gbr_parameters)
Fitting 5 folds for each of 32 candidates, totalling 160 fits [CV 1/5; 2/32] START estimator=None, learning_rate=0.01, n_estimators=5, subsample=0.5 [CV 1/5; 1/32] START estimator=None, learning_rate=0.01, n_estimators=5, subsample=0.9 [CV 4/5; 2/32] START estimator=None, learning_rate=0.01, n_estimators=5, subsample=0.5 [CV 2/5; 1/32] START estimator=None, learning_rate=0.01, n_estimators=5, subsample=0.9 [CV 3/5; 1/32] START estimator=None, learning_rate=0.01, n_estimators=5, subsample=0.9 [CV 4/5; 1/32] START estimator=None, learning_rate=0.01, n_estimators=5, subsample=0.9 [CV 2/5; 2/32] START estimator=None, learning_rate=0.01, n_estimators=5, subsample=0.5 [CV 5/5; 2/32] START estimator=None, learning_rate=0.01, n_estimators=5, subsample=0.5 [CV 2/5; 3/32] START estimator=None, learning_rate=0.01, n_estimators=5, subsample=0.2 [CV 5/5; 1/32] START estimator=None, learning_rate=0.01, n_estimators=5, subsample=0.9 [CV 1/5; 3/32] START estimator=None, learning_rate=0.01, n_estimators=5, subsample=0.2 [CV 3/5; 2/32] START estimator=None, learning_rate=0.01, n_estimators=5, subsample=0.5 [CV 4/5; 3/32] START estimator=None, learning_rate=0.01, n_estimators=5, subsample=0.2 [CV 3/5; 3/32] START estimator=None, learning_rate=0.01, n_estimators=5, subsample=0.2 [CV 5/5; 3/32] START estimator=None, learning_rate=0.01, n_estimators=5, subsample=0.2 [CV 1/5; 4/32] START estimator=None, learning_rate=0.01, n_estimators=5, subsample=0.1 [CV 2/5; 4/32] START estimator=None, learning_rate=0.01, n_estimators=5, subsample=0.1 [CV 3/5; 4/32] START estimator=None, learning_rate=0.01, n_estimators=5, subsample=0.1 [CV 4/5; 4/32] START estimator=None, learning_rate=0.01, n_estimators=5, subsample=0.1 [CV 5/5; 4/32] START estimator=None, learning_rate=0.01, n_estimators=5, subsample=0.1 [CV 1/5; 5/32] START estimator=None, learning_rate=0.02, n_estimators=5, subsample=0.9 [CV 2/5; 5/32] START estimator=None, learning_rate=0.02, n_estimators=5, subsample=0.9 [CV 3/5; 5/32] START estimator=None, learning_rate=0.02, n_estimators=5, subsample=0.9 [CV 4/5; 5/32] START estimator=None, learning_rate=0.02, n_estimators=5, subsample=0.9
NOMINATE Predictor Final Model¶
The results of the grid search yielded the following model, which upon instantiation (which takes approximately 1.5 minutes), gives us the following accuracies:
final_model = GradientBoostingRegressor(learning_rate=0.04, max_depth=10, n_estimators=500,subsample=0.9)
final_model.fit(x_train,y_train)
GradientBoostingRegressor(learning_rate=0.04, max_depth=10, n_estimators=500,
subsample=0.9)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingRegressor(learning_rate=0.04, max_depth=10, n_estimators=500,
subsample=0.9)print("Train:",final_model.score(x_train, y_train))
print("Test :", final_model.score(x_test, y_test))
Train: 0.9902131502094773 Test : 0.7805524891099352
output_importances(final_model.feature_importances_)
Evidently our model is overfitting quite a bit...
from sklearn import metrics
y_pred = final_model.predict(x_test)
y_true = y_test
print("Coefficient of Determination", metrics.r2_score(y_pred, y_true))
print("MSE:", metrics.mean_squared_error(y_pred, y_true))
print("MAPE:", metrics.mean_absolute_percentage_error(y_true, y_pred)) # sensitive to relative errors
Feature Importances¶
Sample prediction:¶
df.iloc[3000]
# example prediction:
y_pred = final_model.predict(x_test.iloc[[0]])[0]
y_true = y_test.iloc[0]
print("representative: KING, Peter T.")
print("republican, 106th session of congress")
print("\tpredicted:", y_pred)
print("\ttrue y :", y_true)
Predicting DW-NOMINATE Statewide¶
Predicting Representative Traits Nationally¶
- Age
- Constituent Belief in God
- State/District
- Debts